Regular Expressions

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists.

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. We will present examples using python’s standard re regular expression library. All regular expressions presented below will be so-called "perl" regular expressions, named after the programming language that made them popular. This regular expression syntax is now very common, being use in python and many other regular expression systems. This syntax is also available in grep using the -P flag [1], and the grep alternative ack. The input to a regular expression can be just about any kind of text imaginable, from long documents to values of a variable in a particular operation in your program.

[1]: Though apparently not after OS X mountain lion, however, i believe sudo port install grep will fix the problem.

Regular Expression Basics

The language used by regular expressions includes both ordinary and special characters. Upper and lower case letters and all digits (the alphanumeric characters) are all treated literally--if you search for "foo", you will match "food", "fool", "foot" and "foosball", all strings containing "foo". The simplest regular expressions are just searches for a plain alphanumeric "needles" in some text "haystack".

Many punctuation characters have special meanings in regular expressions. If you wish to match the punctuation character literally, you must escape it with a "\". For instance, if you want to match a period (a special symbol, as we will see) you would look for "\." Below are some of the special characters used in regular expressions. Regular expression syntax is actually pretty vast, way too extensive to cover here. Instead, we present some of the most useful and common characters. Links to more information will be provided below.

  • +: Match one or more of the preceding character or group.
  • *: Match zero or more of the preceding character or group.
  • ?: Match zero or one of the preceding character or group.
  • *?, +?, ??: ordinarily, *, + and ? are greedy, matching the longest possible string that satisfies the regular expression. Adding the ? to any of these makes it non-greedy, instead matching the shortest possible expression.
  • .: Matches any character except newline.
  • $: Matches the end of a string (or before a newline character that ends a string). In re's multiline mode, matches the end of each line.
  • ^: Matches the beginning of a string (or after a newline character that ends a string). In re's multiline mode, matches the start of each line.
  • {k}: matches exactly k occurrences of the preceding character or group.
  • {j,k}: greedily match between j or k occurrences of the preceding character or group. (use ? to use non-greedy matching).
  • {j,}: greedily match for at least j occurrences of the preceding character or group. (use ? to use non-greedy matching).
  • |: used to denote or. The regular expression a|b will look for either a or b. This works when a and b are any character or group.
  • ( ): Parentheses are used to denote a group. The regular expression in the parentheses can be used in combination with any of the special characters above (and many not listed here). The text that is matched by this group is captured, usable in subsequent steps in the regular expression or available to the program outside the regex.
  • (?: ): A non-capturing group. This works just as above, but doesn’t hold on to the matched contents.
  • (?<=x): Matches any string that is preceded by x (an arbitrary regular expression).
  • [ ]: square braces are used to indicate a set of characters. Any character in this set can be matched. For instance, [abc] can match either a, b, or c. Combined with modifiers above, it can match combinations of many of these three. Ranges of characters can also be specified in these sets. For instance, [A-Z] matches all uppercase letters, [a-z] matches the lowercase, and [0-9] matches the digits. These can be combined: [A-Za-z0-9]. A character set can be negated by leading with a ^ character. for instance [^abc] will match anything but a, b, or c. Note that most special characters lose their meaning in these groups, they are treated literally.
  • \d: matches the digits, 0-9.
  • \D: matches anything but \d.
  • \w: matches any alphanumeric character plus underscore: [A-Za-z0-9_].
  • \W: matches anything but \w.
  • \s: matches any whitespace character: [ \t\n\r\f\v].
  • \S: matches anything but \s.
  • \b: matches the breaks between alphanumeric and non-alphanumeric characters (an empty string), the boundary between \w and \W. Useful for ensuring that what you match is actually a word.
  • \B: matches anything but \b. Useful for ensuring your match is in the middle of a word.

These special characters combined with ordinary strings make up the building blocks of the regular expression syntax. In Python’s re library, these components are sequenced together into a string in the desired order to define the pattern being sought. Once a regular expression is defined, it is passed as the first argument in re’s match function:


In [1]:
# first gotta import the library
import re

needle = "ll"
haystack = "foosball"

result = re.search(needle, haystack)
print result


<_sre.SRE_Match object at 0x1040cb578>

For instance, take the following expression:


In [2]:
result = re.search("ab*c", haystack)

will match when haystack contains "ac", "abc", "abbbc", etc. Similarly:


In [3]:
result = re.search("ab+c", haystack)

will match when haystack contains "abc", "abbc", "abbbc", etc, but not "ac".

An alternative mechanism for building regular expressions and matching strings is provided by the re library. Here, you first compile a regular expression using:


In [4]:
regex = re.compile(needle)

Then you can use this regular expression to find matches:


In [5]:
result = regex.search(haystack)

The two mechanisms are the same; it’s up to the developer to decide which she prefers.

The behavior of re.match( ) can be modified by the inclusion of optional flags. These flags are member variables of the re package. For instance, if we want our regular expression to ignore the case (uppercase/lowercase) of characters when performing matches, you can use the re.IGNORECASE flag: result = re.search(“ab+c”, “AbC”, re.IGNORECASE) will successfully find a match. Formally, either version of the match function accepts an optional comma-separated list of flags as its final argument. For a full list of available flags, please see the re documentation.

Matching and Filtering Text

If the match function fails to find the pattern being sought, then it returns None, a value that will evaluate as False in a boolean comparison (for instance an if-statement). We can use this to restrict further processing to only those strings that satisfy a match. For instance:


In [3]:
def teach_a_class():
  print "yay"
def take_time_off():
  print "zzzz"
    
def take_class(teacher_name):
  if re.search("^josh(ua)?\b?", teacher_name):
    teach_a_class()
  else:
    take_time_off()

take_class("josh")
take_class("joe")


yay
zzzz

Will evaluate the teach_a_class() function only if the teacher_name variable starts with "josh" or "joshua". Otherwise the tame_time_off() function is evaluated.

Similarly, the return value of a match can be used to filter out content that is not desired. Say for instance that we want to filter examples with missing values, represented by a "?"; we could use the following control statement based on a regular expression:


In [9]:
data = ["josh", "?", "mike", "joe"]

for name in data:
  if not re.search("\?", name):
    print name


josh
mike
joe

Here we will evaluate the print statement only if name does not contain any missing values, assuming all examples of missing values are represented by a "?".

Extracting Data -- where regex start to get really cool

In addition to simple matching and filtering, many regular expressions implementations, including python’s re, provide a powerful mechanism for extracting meaningful data from raw text. Through capturing, those strings that satisfy a particular regular expression are extracted from the text being matched, and returned to the program processing the raw data. The portion of regular expressions that should be captured is surrounded by parentheses, "( )". Then, provided the regular expression containing the capturing statement is satisfied, the result of the regular expression will contain a group of text matching patterns. This group method gets the results of the portions of the input text matched by the capturing statements in the regular expression. The matches are indexed from one-- to get the portion of the text matched by first capturing statement, you could query result.group(1), the second parentheses will have its match stored in result.group(2), etc. The value stored at result.group(0), is the entire portion of the input string matched by the regular expression, not just the portion satisfying the capturing parentheses.

As example of data extraction using capturing regular expressions, say we’re scanning some raw text for phone numbers that we wish to retain for later processing. We might try something like:


In [21]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785"""

numbers = []

for line in raw_text.split("\n"):
  result = re.search("(\d{3}-\d{3}-\d{4})", line)
  if result:
    numbers.append(result.group(1))
print numbers


['512-234-5234', '124-512-5555', '125-555-5785']

After running this, the numbers array will contain any phone numbers in the raw text that were of the form ddd-ddd-dddd. Ok, you might think this a bit inflexible. Suppose we want to store area codes separately? We could then modify the above expression using a slightly more complex regular expression:


In [23]:
area_codes = []
numbers = []
for line in raw_text.split("\n"):
  result = re.search("\(?(\d{3})\)?[- ](\d{3}[- ]\d{4})", line)
  if result:
    area_codes.append(result.group(1))
    numbers.append(result.group(2))
print "area codes: %s" % area_codes
print "numbers: %s" % numbers


area codes: ['512', '124', '125']
numbers: ['234-5234', '512-5555', '555-5785']

Now, we will store area codes separately from the the local portion of the phone number. Perhaps more importantly, we’re able to capture a much wider set of phone number formats. The matching is agnostic to whether or not the area code is in parentheses, or if the individual stanzas of the phone number are separated by dashes. NOTE: We would now be ready to use area codes, for instance, to analyze where our users are located geographically.

What happens if there is more than one phone number per line? Python’s re library provides an additional matching function, findall(). This powerful function is used in the same manner as match(). However, instead of returning a results object, findall() returns an array of all non-overlapping groups of matching strings. Take the example below:


In [24]:
raw_text_inline = "(212)-555-5555 123 555 1234 456-789-1111"
results = re.findall("\(?(\d{3})\)?[- ](\d{3}[- ]\d{4})", raw_text_inline)

print results


[('212', '555-5555'), ('123', '555 1234'), ('456', '789-1111')]

This method is invaluable when you suspect that there may be more than one valid match in the body of text being considered. Note that the different groups in each match are stored in an array of tuples. Tuples can be though of as data structure with many of the same properties as a list, for instance, indexing:


In [25]:
for matches in results:
    print matches[0] # prints the area codes


212
123
456

The examples will look like gobbledygook at first. But after you go through some actual cases, and especially after you struggle to write a few for a real data science task, you will realize that you're not in Kansas any longer. Now get ready for a horse of a different color...

String Replacement

In addition to matching and extraction, regular expressions can be used to change data--especially unstructured text--in very powerful ways. In particular, regex allow you to find specific patterns and then replace them with specified strings.

As a data scientist, this is useful when trying to get data formated correctly as input to a specific system, such as when doing data cleanup, In python’s re library, the function used for replacement is sub() (think "substitute"). The pattern for invoking sub() is re.sub(needle, replacement, haystack). This will return a version of haystack where all instances of needle have been substituted with replacement. Imagine we want to conceal all phone numbers in a document. We could use the following call to sub():


In [26]:
raw_text_inline = "(212)-555-5555 123 555 1234 456-789-1111"
re.sub("\(?\d{3}\)?[- ]\d{3}[- ]\d{4}", "XXX-XXX-XXXX", raw_text_inline)


Out[26]:
'XXX-XXX-XXXX XXX-XXX-XXXX XXX-XXX-XXXX'

When performing substitution, matches found using the capturing mechanism are available to the replacement using \1, \2, and so on, as shortcuts to group(1), group(2), etc. In order to use this back-referencing capability, we need to tell the sub() mechanism to treat the replacement string as a regular expression. We do this by writing it with r' '. For instance, if we want to make sure all area codes are surrounded by parentheses, we can use:


In [27]:
re.sub("\(?(\d{3})\)?[- ](\d{3}[- ]\d{4})", r"(\1) \2", raw_text_inline) # Note the use of r'(\1) \2'


Out[27]:
'(212) 555-5555 (123) 555 1234 (456) 789-1111'

In [ ]: